Unit 2 Module 1
Spatial Data Analysis with R (01:450:320)
1 Working with spatial vector data
In this section we will start to work with spatial vector data,
learning the most common forms of vector-based spatial operations. We
will be working primarily with the sf package, which
replaces the earlier sp package.
The material in this section assumes that the reader is familiar with standard GIS operations and concepts, ranging from projections and transformations to intersections, buffers, and differences.
2 Preparation
2.1 Set-up
Let’s start by loading sdadata.
sdadata imports sf. sf depends
on the following external libraries (from here):
SystemRequirements: GDAL (>= 2.0.1), GEOS (>= 3.4.0), PROJ (>= 4.8.0), sqlite3
These are external libraries that provide key geospatial capabilities
that are used by R and many other languages and platforms
that can be used for geospatial analytics. If you are using the docker
container, these are already installed so you don’t have to worry about
them. Read more about them by following these links: GDAL, GEOS, PROJ, sqlite3
You should already have your class Rstudio project (e.g. ls320)
setup, with a notebooks/ folder within it. If you don’t
already have it, create a data/ folder within
notebooks/, which is where you will practice writing
spatial data files.
3 Reading, writing, and making vector data
This reading introduces reading, writing, and making vector data, how
to work with their coordinate reference systems (CRS), and how to do
some basic plotting, leavened with some additional R
programming examples. However, the first thing we need to do is learn
about simple features, which is the representation of vector data that
is employed by sf (which stands for simple features).
3.1 Simple features, explained
To learn more about what simple features are and how they work, and
why R developed a package for them, please read the
sf introductory vignette
written by Edzer Pebesma, the primary sf package
developer.
3.2 From non-spatial to spatial
We are going to start by reading in the species dataset,
which installs with sdadata. It contains occurrence records
for the “Big Five” (Elephant, Lion, Leopard, Buffalo, and Rhinoceros) in
Tanzania, sourced from the Global
Biodiversity Information Facility (GBIF). This dataset is a subset
that contains the following variables:
- family: The biological family to which the species belongs (e.g., Felidae or Elephantidae).
- species: The scientific name of the animal (e.g., Panthera leo).
- month: The month of the year the observation was recorded (1−12)
- year: The year of the observation, allowing for analysis of trends over time.
- x: Longitude in decimal degrees.
- y: Latitude in decimal degrees.
Here’s a quick look:
# Chunk 1
species <- read_csv(system.file("extdata/species.csv", package = "sdadata"))
#
# 1
set.seed(30)
species %>%
sample_n(5) %>%
arrange(year, month)
#> # A tibble: 5 × 6
#> family species month year x y
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Felidae Panthera leo 10 2018 34.9 -2.42
#> 2 Elephantidae Loxodonta africana 1 2023 36.1 -4.05
#> 3 Felidae Panthera leo 1 2024 36.1 -3.78
#> 4 Elephantidae Loxodonta africana 8 2024 36.1 -3.67
#> 5 Elephantidae Loxodonta africana 3 2025 34.9 -2.60
#
# #2
species %>%
distinct(species) %>%
count()
#> # A tibble: 1 × 1
#> n
#> <int>
#> 1 5
#
# #3
species %>%
group_by(species, year) %>%
count() %>%
arrange(year)
#> # A tibble: 133 × 3
#> # Groups: species, year [133]
#> species year n
#> <chr> <dbl> <int>
#> 1 Loxodonta africana 2000 4
#> 2 Panthera leo 2000 2
#> 3 Panthera pardus 2000 2
#> 4 Syncerus caffer 2000 4
#> 5 Loxodonta africana 2001 5
#> 6 Panthera leo 2001 9
#> 7 Panthera pardus 2001 2
#> 8 Syncerus caffer 2001 2
#> 9 Diceros bicornis 2002 1
#> 10 Loxodonta africana 2002 4
#> # ℹ 123 more rowsOkay, #1 shows us that this dataset, read in as species,
is a tibble, which means it is non-spatial at the moment,
even though its comes with coordinates. #2 shows us that there are 5
species (big five) in the dataset, and #3 shows a varying number of
observations per species per year.
Given that these are point data, we can turn them into a facsimile of spatial data by simply plotting them and having a look.
To avoid possible duplicate points, we first cut species
down to just unique species records and coordinates (GBIF data is
messy), and then use gglot to shows us, in a very crude
way, where these points lie in space. This works, because
species$x (the longitude) is exactly analogous to the x
coordinate of a scatter plot, and species$y (latitude) is
the y coordinate. But it is still not a spatial object. So, let’s
convert it to one, and then plot it again.
# Chunk 3
# #1
species <- st_as_sf(species, coords = c("x", "y"))
species
#> Simple feature collection with 6951 features and 4 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: 29.91544 ymin: -11.09595 xmax: 38.79323 ymax: -1.44546
#> CRS: NA
#> # A tibble: 6,951 × 5
#> family species month year geometry
#> <chr> <chr> <dbl> <dbl> <POINT>
#> 1 Elephantidae Loxodonta africana 6 2012 (34.08678 -2.276964)
#> 2 Bovidae Syncerus caffer 6 2012 (35.75882 -3.5614)
#> 3 Bovidae Syncerus caffer 6 2012 (35.44965 -3.230564)
#> 4 Felidae Panthera pardus 6 2012 (34.96447 -2.475424)
#> 5 Elephantidae Loxodonta africana 6 2012 (35.13262 -1.883995)
#> 6 Bovidae Syncerus caffer 6 2012 (35.73392 -3.0035)
#> 7 Elephantidae Loxodonta africana 6 2012 (35.61116 -3.004654)
#> 8 Elephantidae Loxodonta africana 6 2012 (35.97219 -3.429279)
#> 9 Bovidae Syncerus caffer 6 2012 (35.12084 -1.948731)
#> 10 Elephantidae Loxodonta africana 6 2012 (35.03655 -1.983431)
#> # ℹ 6,941 more rows
#
# #2
str(species)
#> sf [6,951 × 5] (S3: sf/tbl_df/tbl/data.frame)
#> $ family : chr [1:6951] "Elephantidae" "Bovidae" "Bovidae" "Felidae" ...
#> $ species : chr [1:6951] "Loxodonta africana" "Syncerus caffer" "Syncerus caffer" "Panthera pardus" ...
#> $ month : num [1:6951] 6 6 6 6 6 6 6 6 6 6 ...
#> $ year : num [1:6951] 2012 2012 2012 2012 2012 ...
#> $ geometry:sfc_POINT of length 6951; first list element: 'XY' num [1:2] 34.09 -2.28
#> - attr(*, "sf_column")= chr "geometry"
#> - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA
#> ..- attr(*, "names")= chr [1:4] "family" "species" "month" "year"
#
# #2
plot(st_geometry(species), pch = 20, cex = 0.5)In #1, we use our first function from sf, which is
st_as_sf, which is used to “convert foreign object to an sf
object” (?st_as_sf). In this case we coerce a
data.frame/tibble into an sf
POINT object, by specifying the x and y coordinates for the “coords”
argument. The summary of species, now an sf
object, displays metadata describing the number of features and
variables, the geometry type, the number of dimensions, the object’s
bounding box (bbox), its spatial reference system information (EPSG
string and proj4string; more on these later), which are currently not
set, and then it shows the tibble with values. x
and y no longer occur, and have been replaced by a geometry
column now.
Looking at str in #2, we see the object is constructed
of 4 classes (sf, tbl_df, tbl,
and data.frame), with the geometry variable being of class
sfc_POINT. Please refer back to the how
simple features are organized section on the sf website
to understand how the geometry is structured.
We then plot the new sf object, using
st_geometry to strip out the geometries of this object,
dropping the other variables in the tibble with associated
with the object. The reason for this is that plot
(sf’s generic for plotting sf objects) would
otherwise default to creating one plot panel for each variable in the
tibble. We can see this behavior here by comparison:
In the above, we illustrate this multi-panel plotting behavior using
just the first 10 rows of species, using dplyr
syntax to slice out those rows, in the process illustrating
that sf is designed to work with the
tidyverse.
So, in Chunk 3 above, we have seen how we can coerce the non-spatial
species into a spatial object sf. We can
coerce species back to an ordinary
data.frame/tibble by simply doing this:
# Chunk 5
species %>% as_tibble
#> # A tibble: 6,951 × 5
#> family species month year geometry
#> <chr> <chr> <dbl> <dbl> <POINT>
#> 1 Elephantidae Loxodonta africana 6 2012 (34.08678 -2.276964)
#> 2 Bovidae Syncerus caffer 6 2012 (35.75882 -3.5614)
#> 3 Bovidae Syncerus caffer 6 2012 (35.44965 -3.230564)
#> 4 Felidae Panthera pardus 6 2012 (34.96447 -2.475424)
#> 5 Elephantidae Loxodonta africana 6 2012 (35.13262 -1.883995)
#> 6 Bovidae Syncerus caffer 6 2012 (35.73392 -3.0035)
#> 7 Elephantidae Loxodonta africana 6 2012 (35.61116 -3.004654)
#> 8 Elephantidae Loxodonta africana 6 2012 (35.97219 -3.429279)
#> 9 Bovidae Syncerus caffer 6 2012 (35.12084 -1.948731)
#> 10 Elephantidae Loxodonta africana 6 2012 (35.03655 -1.983431)
#> # ℹ 6,941 more rowsThis gets us back to a tibble, but the x and
y variables do not reappear, instead we still have the geometry
column. If we want to get back to the original structure of the
tibble, its more complicated, but certainly possible:
# Chunk 6
bind_cols(species %>% st_drop_geometry(),
st_coordinates(species) %>% as_tibble()) %>%
select(family, species, month, year, X, Y) %>%
rename(x = X, y = Y)
#> # A tibble: 6,951 × 6
#> family species month year x y
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Elephantidae Loxodonta africana 6 2012 34.1 -2.28
#> 2 Bovidae Syncerus caffer 6 2012 35.8 -3.56
#> 3 Bovidae Syncerus caffer 6 2012 35.4 -3.23
#> 4 Felidae Panthera pardus 6 2012 35.0 -2.48
#> 5 Elephantidae Loxodonta africana 6 2012 35.1 -1.88
#> 6 Bovidae Syncerus caffer 6 2012 35.7 -3.00
#> 7 Elephantidae Loxodonta africana 6 2012 35.6 -3.00
#> 8 Elephantidae Loxodonta africana 6 2012 36.0 -3.43
#> 9 Bovidae Syncerus caffer 6 2012 35.1 -1.95
#> 10 Elephantidae Loxodonta africana 6 2012 35.0 -1.98
#> # ℹ 6,941 more rowsWe drop the geometry of species, and then
we use dplyr::bind_cols (equivalent to cbind)
to bind the coordinates back to the tibble. The coordinates
are obtained using st_coordinates, an sf
function used to “retrieve coordinates in matrix form”
(?st_coordinate), and we have to coerce that to a
tibble for bind_cols to work. We can use
select to reorder the columns in their original form, and
rename the coordinate columns back to their original lower case
(st_coordinates returns X and Y columns).
This is a standard way for geoscientists to convert spatial data back
and forth.
3.3 Read and write spatial data
That was a very quick dip into spatial waters, using a points
example. We are going to learn to read and write (in reverse order)
spatial data. We still have species back as an sf object.
We are now going to write it out to a spatial file.
# Chunk 6
st_write(obj = species, dsn = file.path(tempdir(), "species.shp"),
delete_dsn = TRUE)
#> writing: substituting ENGCRS["Undefined Cartesian SRS with unknown unit"] for missing CRS
#> Warning in CPL_write_ogr(obj, dsn, layer, driver, as.character(dataset_options), : GDAL Error 1:
#> /var/folders/8z/trk9m2ks7ksd2s963fkz3dg80000gq/T//RtmpXrss7r/species.shp does not appear to be a
#> file or directory.
#> Deleting source `/var/folders/8z/trk9m2ks7ksd2s963fkz3dg80000gq/T//RtmpXrss7r/species.shp' failed
#> Writing layer `species' to data source
#> `/var/folders/8z/trk9m2ks7ksd2s963fkz3dg80000gq/T//RtmpXrss7r/species.shp' using driver `ESRI Shapefile'
#> Writing 6951 features with 4 fields and geometry type Point.
dir(tempdir(), pattern = "species")
#> [1] "species.dbf" "species.prj" "species.shp" "species.shx"The code above uses st_write to create that old standby,
the ESRI shapefile, that clunky, annoying format that has many files for
one thing. Note that I have the option delete_dsn = TRUE
set, which is needed if the file already exists in that location, and
you want to recreate it (which I did in this case, because I was
re-running this code many times as I wrote this). For
st_write, the “obj” argument asks for the sf
object to write, “dsn” is the name and path for the output file. By
adding the “.shp” at the end of the filename, st_write
knows to use the “ESRI Shapefile” driver. For this example, I used again
the tempdir() function to write this object to a temporary
directory, but, as you follow along, you should change the
file path to correctly lead to your own
xyz320/notebooks/data path.
Looking in the output directory here, we see that there are only three files now (.dbf, .shp, and .shx). We are missing a .prj because of the following reason:
species does not have a coordinate reference system
(CRS) associated with it yet, even though we can tell it is in
geographic coordinates. Remember this, because we will come back to
it.
Because .shp is an annoying format, we can try other types. Let’s do the much more convenient “geojson” format, which produces a single file.
# Chunk 8
st_write(obj = species, dsn = file.path(tempdir(), "species.geojson"),
delete_dsn = TRUE)
#> writing: substituting ENGCRS["Undefined Cartesian SRS with unknown unit"] for missing CRS
#> Deleting source `/var/folders/8z/trk9m2ks7ksd2s963fkz3dg80000gq/T//RtmpXrss7r/species.geojson' failed
#> Writing layer `species' to data source
#> `/var/folders/8z/trk9m2ks7ksd2s963fkz3dg80000gq/T//RtmpXrss7r/species.geojson' using driver `GeoJSON'
#> Writing 6951 features with 4 fields and geometry type Point.
dir(tempdir(), pattern = "species")
#> [1] "species.dbf" "species.geojson" "species.prj" "species.shp" "species.shx"We see that we have just a single file to hold the .geojson, which is much nicer, and is an increasingly widely used file format, so you should use it whenever possible in place of .shp. However, we will stick with ESRI shapefiles for this class, since they are so widely used still. So let’s read back in from the shapefile itself. I am going the first delete the .geojson version, for the sake of directory cleanliness (having shown you a nicer format, I am not going to work with because so much is still done with
# Chunk 9
file.remove(dir(tempdir(), pattern = "species.geojson", full.names = TRUE))
#> [1] TRUE
rm(species)
species <- st_read(dir(tempdir(), pattern = "species.shp", full.names = TRUE))
#> Reading layer `species' from data source
#> `/private/var/folders/8z/trk9m2ks7ksd2s963fkz3dg80000gq/T/RtmpXrss7r/species.shp'
#> using driver `ESRI Shapefile'
#> Simple feature collection with 6951 features and 4 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: 29.91544 ymin: -11.09595 xmax: 38.79323 ymax: -1.44546
#> Projected CRS: Undefined Cartesian SRS with unknown unitWe used file.remove to get rid of the .sqlite, and then
used rm to get rid of the existing species sf
object, and then read back in the species we had written to
the .shp. So we have species back again.
Now, there are two more datasets that come with sdadata
to read in using st_read: roads.geojson and
districts.geojson. The paths to both are found with the
system.file function. I’ll set up the first one for
you.
# Chunk 10
fnm <- system.file("extdata/roads.geojson", package = "sdadata")
roads <- st_read(dsn = fnm) #%>% st_set_crs("ESRI:102022")
#> Reading layer `roads' from data source
#> `/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/sdadata/extdata/roads.geojson'
#> using driver `GeoJSON'
#> Simple feature collection with 4569 features and 1 field
#> Geometry type: MULTILINESTRING
#> Dimension: XY
#> Bounding box: xmin: 478749.8 ymin: -1385619 xmax: 1591588 ymax: -118296.2
#> Projected CRS: Africa_Albers_Equal_Area_Conic
fnm <- system.file("extdata/districts.geojson", package = "sdadata")
districts <- st_read(dsn = fnm)
#> Reading layer `districts' from data source
#> `/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/sdadata/extdata/districts.geojson'
#> using driver `GeoJSON'
#> Simple feature collection with 170 features and 1 field
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: 29.58953 ymin: -11.76235 xmax: 40.44473 ymax: -0.983143
#> Geodetic CRS: WGS 84Let’s end this section by a quick look at these data. We are going to
plot districts, roads, and
species onto one plot.
# Chunk 11
par(mar = c(0, 0, 0, 0))
plot(st_geometry(districts), col = "grey")
plot(st_geometry(roads), col = "red", add = TRUE)
plot(st_geometry(species), pch = 20, reset = FALSE, col = "blue", add = TRUE)So what have we done? First, we use the par function
with the mar (margin) argument and a 4-element vector of 0s
to remove the inner margins of the plot so that the map of Tanzania
fills the plot frame. We then use plot to draw the map of
districts (wrapped in st_geometry, to just
grab the geometry part), which are administrative boundaries for
Tanzania. We then use lines to add roads on top of that,
making them red. Finally, we add the species data as blue
points. Note the use of add = TRUE, which is necessary to
overlay the two plots on the district boundaries without
creating a new plot of each.
That’s great, you say, but I see no roads in the map.
Where are they? And why are there two species outside of Tanzania?
We’ll look at that in our next section, but a hint of why can be seen
in the outputs from Chunk 10. You will note that in Chunks 9 and 10 that
st_read (the opposite of st_write) gives a
brief summary of the spatial data being read into R.
3.4 Coordinate Reference Systems and Projections
The reason you can’t see the roads on the previous map can be illustrated as follows:
# Chunk 12
sfdat <- list("districts" = districts, "roads" = roads, "species" = species)
sapply(sfdat, function(x) st_bbox(x))
#> districts roads species
#> xmin 29.589529 478749.8 29.91544
#> ymin -11.762349 -1385619.0 -11.09595
#> xmax 40.444735 1591587.5 38.79323
#> ymax -0.983143 -118296.2 -1.44546In the code above, we add our 3 sf objects into a
list (sfdat), and then use sapply
to apply st_bbox to each list element to get their bounding
boxes. That produces an output matrix that shows in the westernmost
(xmin), southernmost (ymin), easternmost (xmax), and northernmost (ymax)
coordinates of each object. This shows us that roads’
coordinates are in different units (meters) than districts’
and species’, which are decimal degrees. So when we tried
to plot roads on the same map as districts and
species, roads didn’t appear because the units
are not the same. You could get a map of roads if you
plotted them on their own.
3.4.1 Checking the CRS of
an sf object
In this case, we have to do a bit more, which is to convert
roads coordinate reference system (CRS) to decimal degrees.
How do we do that? First, we have to know the crs we want to transform
roads into, which we can do by checking the
st_crs of species or
districts:
# Chunk 13
st_crs(districts)$proj4string
#> [1] "+proj=longlat +datum=WGS84 +no_defs"
st_crs(roads)$proj4string
#> [1] "+proj=aea +lat_0=0 +lon_0=25 +lat_1=20 +lat_2=-23 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs"We use st_crs, which is sf’s function for
checking and assigning a CRS to an sf object. When you run
it as we have done with st_crs(districts), it returns an
object of class crs (you can verify that by running
class(st_crs(districts))). We see that
districts and roads have different values in
their “PROJCRS” fields, which holds the projection string that contains
the projection parameters of the CRS. If you want to understand the
meaning of projection string parameters, please have a look at the documentation for the proj.4
library, specifically the page on parameters.
This one basically tells us that districts has a
geographic (longlat) projection, and uses the WGS84 datum. Let’s check
the crs for all three objects now, using our old friend
sapply
# Chunk 14
sapply(sfdat, function(x) st_crs(x)$proj4string)
#> districts
#> "+proj=longlat +datum=WGS84 +no_defs"
#> roads
#> "+proj=aea +lat_0=0 +lon_0=25 +lat_1=20 +lat_2=-23 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs"
#> species
#> NAThese results show us that species does not have a
projection string. So how come it was able to plot on the same map as
districts? Because the coordinates are in the same units.
However, we want to fix that before we move to the next step,
reprojecting:
# Chunk 15
st_crs(species) <- st_crs(districts)
#> Warning: st_crs<- : replacing crs does not reproject data; use st_transform for that
st_crs(sfdat$species) <- st_crs(districts)
#> Warning: st_crs<- : replacing crs does not reproject data; use st_transform for that
st_crs(sfdat$species)$proj4string
#> [1] "+proj=longlat +datum=WGS84 +no_defs"Simple–we apply the fix to species and to
sfdat$species by assigning each the proj4string from
districts. We can do that because we are pretty sure that
the coordinates for species are from a geographic coordinate system that
uses the WGS84 datum. If that is incorrect, we are introducing some
spatial error. But for now, this is fine.
Another note, when running st_crs, we are extracting the
proj4string element from the resulting crs
object. That is because it provides a more compact string representation
of the CRS information. Running st_crs(sfdat$species)
produces a much lengthier (and more informative) output in
wkt format. Go ahead and try that.
3.4.2 Reprojecting
Okay, so now we want to change roads to the same CRS as
districts.
We use sf‘s st_transform function to
reproject sfdat$roads using the parameters supplied by
districts’ p4s, which is passed into the function’s “crs”
argument via the st_crs function applied to
districts.
And that’s all. We can now redo our plots. I am going to show you two
ways to do it. First, the way we have already done it with
sf:plot:
# Chunk 17
par(mar = c(0, 0, 0, 0))
plot(sfdat$districts %>% st_geometry(), col = "grey")
plot(sfdat$roads %>% st_geometry(), col = "red", add = TRUE)
plot(sfdat$species %>% st_geometry(), col = "blue", pch = 20, add = TRUE)And with ggplot:
# Chunk 18
ggplot(sfdat$districts) + geom_sf() +
geom_sf(data = sfdat$roads, col = "red") +
geom_sf(data = sfdat$species, col = "blue")So now you see the roads appearing as red lines on the map.
ggplot adds a coordinate grid to the outside of the plot,
and has a special function geom_sf that is designed to plot
sf geometries. However, it is still substantially slower
than sf::plot for the time being, so we’ll mostly stick
with the latter.
3.5 Making simple features from scratch
Before ending this section, it is worth looking at how we can make
our own sf objects from scratch, which we might need to do
from time to time.
# Chunk 19
# #1
pt <- st_point(x = c(34, -6))
pt
#> POINT (34 -6)
#
# #2
pts <- st_multipoint(x = cbind(x = c(35.5, 36, 36.5), y = c(-5, -5.5, -6)))
pts
#> MULTIPOINT ((35.5 -5), (36 -5.5), (36.5 -6))
#
# #3
sline <- st_linestring(cbind(x = c(35, 35.5, 36), y = c(-5.5, -6, -6.5)))
sline
#> LINESTRING (35 -5.5, 35.5 -6, 36 -6.5)
#
# #4
pol <- st_polygon(list(cbind(x = c(34.5, 35.5, 35, 34, 34.5),
y = c(-6, -7, -7.5, -6.5, -6))))
pol
#> POLYGON ((34.5 -6, 35.5 -7, 35 -7.5, 34 -6.5, 34.5 -6))
#
# 5
par(mar = c(0, 0, 0, 0))
plot(districts %>% st_geometry(), col = "grey")
plot(pt, add = TRUE, col = "red", pch = 20)
plot(pts, add = TRUE, col = "blue", pch = 20)
plot(sline, add = TRUE, col = "cyan", lwd = 2)
plot(pol, add = TRUE, col = "orange", lwd = 2)We use st_point in #1 to create a single point object,
based on a two-element vector that consists of the x and y coordinate.
In #2, st_multipoint create an object with three points,
based on an matrix containing the x coordinates in the first column and
y coordinates in second column. #3 convert a similar matrix of
coordinates into an st_linestring, and #4 creates a polygon
object, also using a matrix as input–note that the first and last
coordinates in both the x and y coordinates are the same, which is done
in order to close the polygon. The printout of each sfg
(simple feature geometry) shows that they store coordinates as point
pairs separated by commas (except for pt, since it is just
a single point). In #5 we plot each object over the
districts polygons.
We’ll go bit further in making new features in the section on spatial operations
3.6 Practice
3.6.1 Questions
What are the primary simple feature geometry types?
How do you convert a
tibble(ordata.frame) into ansfobject?When plotting (mapping)
sfobjects, what happens if your object has multiple variables?When constructing new
sfobjects, in what format do you have to provide inputs to thest_*constructor functions. How does the input object differ in the case of a POLYGON versus MULTIPOINT or LINESTRING?
3.6.2 Code
Working with the
speciessfobject, convert it to just a set of points without any other variables.Write out
specieswith it CRS set to the same asdistrictsinto an sqlite file within yournotebooks/datafolder. Remove thespeciesobject from your workspace, and then read back intoRthe sqlite you just wrote out.Examine the
classandstrofdistrictsandroads.Create a plot of
roads–just the geometries–and make the color of the lines blue.Create a plot of
districts, usingdplyr::selectto choose the distName variable from the objectsdata.frame. Set the title to “Tanzania Districts” using the “main” argument. Don’t specify a color.Create an
st_multipointobject using x coordinates ofc(35, 36, 37)and y coordinates ofc(-5, -6, -7), and plot that overdistrictsas orange points.
4 Spatial properties and attribute fields
We now begin to learn how to perform analyses with spatial vectors, starting with calculating basic spatial properties and manipulating the attribute tables associated with spatial vectors.
As you probably closed this project between Section 1 and 2, you will
also need to reload the roads, districts, and
species datasets (coercing the latter back to an
sf object). The code in Section 1
will show you how to do that, but here is some code to recreate most of
what we need:
# Chunk 20
roads <- system.file("extdata/roads.geojson", package = "sdadata") %>% st_read()
#> Reading layer `roads' from data source
#> `/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/sdadata/extdata/roads.geojson'
#> using driver `GeoJSON'
#> Simple feature collection with 4569 features and 1 field
#> Geometry type: MULTILINESTRING
#> Dimension: XY
#> Bounding box: xmin: 478749.8 ymin: -1385619 xmax: 1591588 ymax: -118296.2
#> Projected CRS: Africa_Albers_Equal_Area_Conic
districts <- system.file("extdata/districts.geojson", package = "sdadata") %>%
st_read()
#> Reading layer `districts' from data source
#> `/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/sdadata/extdata/districts.geojson'
#> using driver `GeoJSON'
#> Simple feature collection with 170 features and 1 field
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: 29.58953 ymin: -11.76235 xmax: 40.44473 ymax: -0.983143
#> Geodetic CRS: WGS 84
species <- system.file("extdata/species.csv", package = "sdadata") %>%
read_csv() %>% st_as_sf(coords = c("x", "y"), crs = 4326)
#> Rows: 6951 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): family, species
#> dbl (4): month, year, x, y
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.This looks a bit different than what we did before. In this
construction, we make use of dplyr pipes to do everything
in one go for all three datasets. We pipe the file path to
roads.geojson straight to sf::st_read(), and
that reads in it as a one-liner. Same goes for districts
(despite the line break). species gets the file path piped
to read_csv, and then the coercion to sf
happens right downstream. We add the crs = 4326 argument to
st_as_sf to provide the missing projection information
(4326 is the EPSG identifier for GCS WGS84)
4.1 Spatial properties
First, due to recent changes in sf, and how it handles
spherical geometries, we have to set the following:
Otherwise if we try to calculate lengths or create buffers of vectors that are in geographic coordinates systems, we get odd results.
4.1.1 Area
Using polygons to calculate area is a fairly basic but important
spatial vector operation. To calculate meaningful areas, generally you
need your data to be projected into a coordinate system in which area is
one of the true properties. However, the function we are going to use,
sf::st_area, will estimate area even for data in geographic
coordinates.
# Chunk 21
# #1
dist_areas <- districts %>% st_area()
dist_areas
#
# #2
dist_areas %>% units::set_units("ha")
#
# #3
dist_areas %>% units::set_units("km^2")
#> Units: [m^2]
#> [1] 1237181424 266991880 3237796747 8064122142 1265947533 6981462233 15909001482
#> [8] 364969525 538906808 728705613 5630896694 9132303284 7289672419 2607595710
#> [15] 5792245985 3958699261 7455226936 8055578543 3074286510 4707386048 2280950383
#> [22] 1606578879 18784745856 369109423 9243791258 572527986 7514438395 7463660843
#> [29] 1617874960 83240902 5133730982 2506182100 2824869889 3502512919 3305058442
#> [36] 231069002 295025066 229584392 234454737 29625841684 16941638272 318438971
#> [43] 1506332304 2208746511 7195183068 911823145 13406516414 967687278 92676984
#> [50] 10054389973 902810020 1345846714 63391414 1967428180 1470230691 6220454054
#> [57] 1216778726 219107154 256032980 505758018 379464273 15000069365 5974912721
#> [64] 1063850901 34313758593 5973961212 2515594227 5569089606 470998471 3674639527
#> [71] 13131616674 3799582883 19817537139 3092906601 2167876971 1069975601 64836217
#> [78] 2001857930 11157052367 1534469486 17506256581 755895744 14439000656 2812972831
#> [85] 252548054 2155115205 217763048 15456587 1971609350 13545763762 11774234334
#> [92] 12458010586 288349054 6631589820 23461563741 4005226578 753345006 3692082480
#> [99] 169900954 5203426450 1952675671 2048455953 254659437 3254313104 1517275875
#> [106] 2579364563 182725205 3152995098 628109329 5952783983 883468899 3996387124
#> [113] 3454262122 3536074287 3437196007 8455502522 1499486177 705286644 5031222977
#> [120] 642661933 2827362961 12747274544 5569504313 9824247730 5150341159 1373838352
#> [127] 4839969595 21765250729 3062559801 14045064690 578294562 18413517705 6754668703
#> [134] 1363202971 4182361771 3568623180 554679096 5484536065 1327969684 4689877393
#> [141] 3666986218 9018397768 8860767776 4520782523 28933748592 3120855417 2399966691
#> [148] 720561356 1880346226 3857652254 4708755744 11164654665 87466772 7064482800
#> [155] 15336076188 7864284875 26279536320 1460815874 5415791391 11967050744 6529182188
#> [162] 837413267 6433087358 3203142195 225328280 4091625158 2711929790 1497926427
#> [169] 1763138596 596500408
#> Units: [ha]
#> [1] 123718.142 26699.188 323779.675 806412.214 126594.753 698146.223 1590900.148
#> [8] 36496.952 53890.681 72870.561 563089.669 913230.328 728967.242 260759.571
#> [15] 579224.599 395869.926 745522.694 805557.854 307428.651 470738.605 228095.038
#> [22] 160657.888 1878474.586 36910.942 924379.126 57252.799 751443.839 746366.084
#> [29] 161787.496 8324.090 513373.098 250618.210 282486.989 350251.292 330505.844
#> [36] 23106.900 29502.507 22958.439 23445.474 2962584.168 1694163.827 31843.897
#> [43] 150633.230 220874.651 719518.307 91182.315 1340651.641 96768.728 9267.698
#> [50] 1005438.997 90281.002 134584.671 6339.141 196742.818 147023.069 622045.405
#> [57] 121677.873 21910.715 25603.298 50575.802 37946.427 1500006.937 597491.272
#> [64] 106385.090 3431375.859 597396.121 251559.423 556908.961 47099.847 367463.953
#> [71] 1313161.667 379958.288 1981753.714 309290.660 216787.697 106997.560 6483.622
#> [78] 200185.793 1115705.237 153446.949 1750625.658 75589.574 1443900.066 281297.283
#> [85] 25254.805 215511.521 21776.305 1545.659 197160.935 1354576.376 1177423.433
#> [92] 1245801.059 28834.905 663158.982 2346156.374 400522.658 75334.501 369208.248
#> [99] 16990.095 520342.645 195267.567 204845.595 25465.944 325431.310 151727.588
#> [106] 257936.456 18272.520 315299.510 62810.933 595278.398 88346.890 399638.712
#> [113] 345426.212 353607.429 343719.601 845550.252 149948.618 70528.664 503122.298
#> [120] 64266.193 282736.296 1274727.454 556950.431 982424.773 515034.116 137383.835
#> [127] 483996.960 2176525.073 306255.980 1404506.469 57829.456 1841351.771 675466.870
#> [134] 136320.297 418236.177 356862.318 55467.910 548453.606 132796.968 468987.739
#> [141] 366698.622 901839.777 886076.778 452078.252 2893374.859 312085.542 239996.669
#> [148] 72056.136 188034.623 385765.225 470875.574 1116465.466 8746.677 706448.280
#> [155] 1533607.619 786428.488 2627953.632 146081.587 541579.139 1196705.074 652918.219
#> [162] 83741.327 643308.736 320314.220 22532.828 409162.516 271192.979 149792.643
#> [169] 176313.860 59650.041
#> Units: [km^2]
#> [1] 1237.18142 266.99188 3237.79675 8064.12214 1265.94753 6981.46223 15909.00148
#> [8] 364.96952 538.90681 728.70561 5630.89669 9132.30328 7289.67242 2607.59571
#> [15] 5792.24599 3958.69926 7455.22694 8055.57854 3074.28651 4707.38605 2280.95038
#> [22] 1606.57888 18784.74586 369.10942 9243.79126 572.52799 7514.43839 7463.66084
#> [29] 1617.87496 83.24090 5133.73098 2506.18210 2824.86989 3502.51292 3305.05844
#> [36] 231.06900 295.02507 229.58439 234.45474 29625.84168 16941.63827 318.43897
#> [43] 1506.33230 2208.74651 7195.18307 911.82315 13406.51641 967.68728 92.67698
#> [50] 10054.38997 902.81002 1345.84671 63.39141 1967.42818 1470.23069 6220.45405
#> [57] 1216.77873 219.10715 256.03298 505.75802 379.46427 15000.06937 5974.91272
#> [64] 1063.85090 34313.75859 5973.96121 2515.59423 5569.08961 470.99847 3674.63953
#> [71] 13131.61667 3799.58288 19817.53714 3092.90660 2167.87697 1069.97560 64.83622
#> [78] 2001.85793 11157.05237 1534.46949 17506.25658 755.89574 14439.00066 2812.97283
#> [85] 252.54805 2155.11521 217.76305 15.45659 1971.60935 13545.76376 11774.23433
#> [92] 12458.01059 288.34905 6631.58982 23461.56374 4005.22658 753.34501 3692.08248
#> [99] 169.90095 5203.42645 1952.67567 2048.45595 254.65944 3254.31310 1517.27588
#> [106] 2579.36456 182.72520 3152.99510 628.10933 5952.78398 883.46890 3996.38712
#> [113] 3454.26212 3536.07429 3437.19601 8455.50252 1499.48618 705.28664 5031.22298
#> [120] 642.66193 2827.36296 12747.27454 5569.50431 9824.24773 5150.34116 1373.83835
#> [127] 4839.96960 21765.25073 3062.55980 14045.06469 578.29456 18413.51771 6754.66870
#> [134] 1363.20297 4182.36177 3568.62318 554.67910 5484.53606 1327.96968 4689.87739
#> [141] 3666.98622 9018.39777 8860.76778 4520.78252 28933.74859 3120.85542 2399.96669
#> [148] 720.56136 1880.34623 3857.65225 4708.75574 11164.65466 87.46677 7064.48280
#> [155] 15336.07619 7864.28488 26279.53632 1460.81587 5415.79139 11967.05074 6529.18219
#> [162] 837.41327 6433.08736 3203.14220 225.32828 4091.62516 2711.92979 1497.92643
#> [169] 1763.13860 596.50041Very simple to calculate area for each polygon in
districts. Note that the default is to calculate m\(^2\), but we can change that to hectares
(in #2) using or km\(^2\) (#3) using
the function units::set_units (the units
package is imported by sf). Note that only numbers with
units can be operated together (e.g. >|+...). This is
sometimes annoying, but safe.
So this was an estimate of area from a non-project dataset. How does
that compare to area from a true equal area projection, such as the
Albers Equal Area projection that roads is projected
in:
# Chunk 21
dist_areas_alb <- districts %>%
st_transform(crs = st_crs(roads)) %>%
st_area()
absd <- mean(abs(dist_areas_alb - dist_areas)) %>% units::set_units("ha")
pctd <- as.numeric(absd / (mean(dist_areas_alb) / 10000)) * 100
absd
#> 0.1048738 [ha]
pctd
#> [1] 2.017064e-05Using st_transform with the CRS from roads
to reproject district, and then calculate polygons areas,
we see the mean absolute difference between areas calculated from the
projected and unprojected datasets is around 0 ha, which is a tiny
percentage (0%) of the average district area. Not much to write home
about, but also not nothing if absolute accuracy of area calculations is
your interest.
# Chunk 22
districts %>%
st_area() %>% units::set_units("km^2") %>%
as.numeric() %>%
tibble(km2 = .) %>%
ggplot() +
geom_histogram(aes(x = km2), bins = 15, fill = "blue", col = "grey") +
xlab(bquote("km"^2))Here’s a look at the district areas (in km\(^2\)) as a histogram. Notice the conversion
steps. First we wanted to convert the calculated areas to a numeric
vector (rather than one of class units), using
as.numeric, then we made it into a tibble,
because ggplot only works with data.frame, and
then we plotted the histogram. There is one new addition there–in
xlab, we use bquote to print a superscript 2
in the km\(^2\) axis label.
4.1.2 Distance
We can also calculate the distances between spatial features. To do
that, we are make use of st_centroid on the
Albers-projected districts data, which we use to find the
central coordinates of each district, and then plot the centroids on top
of the districts themselves.
# Chunk 23
# # 1
dist_cent <- districts %>%
st_transform(., st_crs(roads)) %>%
st_centroid()
#> Warning: st_centroid assumes attributes are constant over geometries
# dist_cent <- st_centroid(st_transform(districts, st_crs(roads))) # equivalent
# #2
par(mar = c(0, 0, 0, 0))
districts %>%
st_transform(., st_crs(roads)) %>%
st_geometry() %>%
plot(col = "grey")
dist_cent %>%
st_geometry() %>%
plot(col = "red", pch = 20, add = TRUE)Notice that right below the centroid calculation pipeline (#1) is a
commented out line, which is the non-pipeline equivalent of the
operation. In #2, we make the plots, setting up the plotting functions
for both the districts polygons and their centroids on the
downstream ends of pipelines.
Moving on, we can now find out how far apart the centroids of each
district are by using another sf function,
st_distance
# Chunk 24
# #1
all_dists <- st_distance(x = dist_cent)
all_dists[1:5, 1:5]
#> Units: [m]
#> 1 2 3 4 5
#> 1 0.00 10773.52 131963.4 76784.10 22154.92
#> 2 10773.52 0.00 129079.6 86863.41 26309.87
#> 3 131963.35 129079.64 0.0 147658.87 153955.95
#> 4 76784.10 86863.41 147658.9 0.00 84144.27
#> 5 22154.92 26309.87 153956.0 84144.27 0.00
#
# #2
d1_dists <- st_distance(x = dist_cent[1, ], y = dist_cent[-1, ])
d1_dists %>% as.vector()
#> [1] 10773.52 131963.35 76784.10 22154.92 48224.76 147078.03 495663.62 474665.43 514089.82
#> [10] 353886.60 362278.55 250294.34 340275.66 184677.86 316956.16 406549.64 527641.50 521549.31
#> [19] 470564.45 468628.64 423168.10 516170.47 529745.02 529970.53 612782.56 628895.99 566686.88
#> [28] 575019.66 560010.92 609614.35 659079.91 605944.44 562199.71 624234.34 373298.61 380520.11
#> [37] 402797.82 410925.40 640600.59 700882.58 692258.56 709156.11 599117.31 662530.06 697844.47
#> [46] 598375.94 736838.39 749211.47 714257.32 53531.44 76012.40 67721.81 108887.05 88347.56
#> [55] 162428.80 52353.39 390902.41 394763.98 434482.54 458452.81 723316.19 841413.35 839203.18
#> [64] 719158.31 849623.83 827749.66 124030.35 140713.62 200709.66 223305.43 161159.43 122529.49
#> [73] 327921.31 343648.81 366292.29 369469.86 363178.41 257102.36 326563.80 618376.01 787743.39
#> [82] 638355.70 743121.93 736326.31 762474.50 427043.00 424225.69 340853.12 598637.28 431458.14
#> [91] 471510.98 419398.57 386654.23 675532.75 910230.56 901426.95 910836.04 898345.07 925203.27
#> [100] 911874.62 920602.72 397994.98 353407.25 351071.53 383560.81 397869.36 440686.55 412550.04
#> [109] 812784.29 681952.75 743611.48 712947.98 753428.93 709107.31 394265.91 454764.07 458753.23
#> [118] 501837.23 622868.53 530822.22 580031.95 809714.84 770461.52 758146.10 759308.41 895288.77
#> [127] 862583.81 948485.07 845917.06 870659.86 901352.65 450843.31 427350.51 305811.30 373118.50
#> [136] 337148.28 262813.88 321876.85 237240.89 300375.88 229361.05 308646.42 279916.68 430791.06
#> [145] 227841.42 240168.50 269285.41 797281.67 770562.91 784042.08 667158.50 811102.94 332184.61
#> [154] 546569.30 399846.38 489949.70 447580.36 523652.09 408873.43 311721.24 285328.88 272796.22
#> [163] 251161.26 284139.82 230940.61 285163.28 305912.46 346104.37 319510.29
#
# #3
d1_dists %>%
as.vector() %>%
quantile() / 1000
#> 0% 25% 50% 75% 100%
#> 10.77352 321.87685 450.84331 692.25856 948.48507There are two variants above. The first (#1) is captured in
all_dists, and it produces a matrix that provides the
distance, in meters, between each district centroid and every other
district centroid (we show the first 5 rows and 5 columns of the matrix,
otherwise it is 72X72). To get that, you simply pass in the
dist_cent object into to the “x” argument. The second
variant (#2) (d1_dists), uses both “x” and “y” arguments,
and it simply finds the distance between the first district centroid and
all the other districts’ centroids. We coerce the matrix to a vector for
more compact printing. In #3, we summarize d1_dists using
the quantile function, converting meters to kilometers.
4.1.3 Length
For the last example of functions that extract spatial properties, we will calculate the line lengths and perimeters.
# Chunk 25
# #1
## WARNING: when use as.numeric(), keep in mind the original unit
road_length <- st_length(roads) %>% as.numeric()
#
# #2
dist_perims <- st_length(districts) %>% as.numeric()
#
# #3
p1 <- ggplot(tibble(x = road_length), aes(x / 1000)) +
geom_histogram(fill = "blue", col = "grey", bins = 15) + xlab("km") +
ggtitle("Tanzania road segment lengths")
p2 <- ggplot(tibble(x = dist_perims), aes(x / 1000)) +
geom_histogram(fill = "purple", col = "grey", bins = 10) + xlab("km") +
ylab("") + ggtitle("Tanzania district perimeters")
cowplot::plot_grid(p1, p2)We use st_length to calculate the length of the line
segments in roads (#1) and the perimeter length of the
polygons in district (#2), converting both to numeric
vectors. We then plot each set of lengths as histograms (#3), converting
units to kilometers (by dividing by 1000, within
geom_histogram), and then use
cowplot::plot_grid to arrange the two histograms
side-by-side.
4.2 Manipulating attributes
sf objects stores attributes as data.frames
or tibbles, and they can be manipulated using the same
preparatory and analytical methods we learned about in the previous
modules. For example, let’s consider some very basic subsetting:
# Chunk 26
par(mar = c(0, 0, 0, 0))
st_geometry(districts) %>% plot(col = "grey")
districts %>%
slice(1, 20, 40) %>%
plot(col = "blue", add = TRUE)
set.seed(123)
species %>%
sample_n(20) %>%
st_geometry() %>%
plot(col = "yellow", pch = "+", add = TRUE)Notice the use of dplyr verbs to subset two datasets on
the fly before they are piped into plot. The plot as setup
with all districts as a backdrop (using
st_geometry to strip away the attribute table), and then we
overlay on that in blue the 1st, 20th, and 40th district,
sliced out of districts. On top of that, we
take a random sample of 20 observations from species, and
plot those on top. You can also index into the sf objects
with []:
# Chunk 27
par(mar = c(0, 0, 0, 0))
plot(st_geometry(districts), col = "grey")
plot(districts[c(1, 20, 40), ], col = "blue", add = TRUE)
set.seed(123)
plot(st_geometry(species[sample(1:nrow(species), 20), ]), col = "yellow",
pch = "+", add = TRUE)The code above makes the identical plot as in Chunk 26, so we don’t display that again, but note the indexing, as well as the lack of piping and more conventional syntax.
4.2.1 Mutating and joining
The Spatial Properties showed how
to calculate area, length, and other properties as separate outputs. Why
not add them stick them in separate data objects. Now let’s see how to
add those properties back the sf attribute
tibble.
# Chunk 27
## WARNING: when use as.numeric(), keep in mind the original unit
districts %>% mutate(area = as.numeric(st_area(.) / 10^6))
#> Simple feature collection with 170 features and 2 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: 29.58953 ymin: -11.76235 xmax: 40.44473 ymax: -0.983143
#> Geodetic CRS: WGS 84
#> First 10 features:
#> distName geometry area
#> 1 Arusha MULTIPOLYGON (((36.82231 -3... 1237.1814
#> 2 Arusha Urban MULTIPOLYGON (((36.63247 -3... 266.9919
#> 3 Karatu MULTIPOLYGON (((35.89154 -3... 3237.7967
#> 4 Longido MULTIPOLYGON (((36.35742 -3... 8064.1221
#> 5 Meru MULTIPOLYGON (((36.89951 -3... 1265.9475
#> 6 Monduli MULTIPOLYGON (((36.54092 -3... 6981.4622
#> 7 Ngorongoro MULTIPOLYGON (((35.32059 -1... 15909.0015
#> 8 Ilala MULTIPOLYGON (((39.21843 -6... 364.9695
#> 9 Kinondoni MULTIPOLYGON (((39.27007 -6... 538.9068
#> 10 Temeke MULTIPOLYGON (((39.3143 -6.... 728.7056So that’s a fairly straightforward way to do it. Just use
mutate with st_area (note that
st_area needs the . in it to function
correctly). In making a new area column, we divide by 10\(^6\) to convert to km\(^2\), and coerce the result to numeric.
Imagine that someone gave you a separate dataset that simply listed district names and their areas for Tanzania, but provided no other spatial information. Say they did something like this:
# Chunk 28
dist_area <- tibble(distName = districts$distName,
area = as.numeric(st_area(districts)) / 10^6)We could use a *_join to merge this onto
districts:
# Chunk 29
left_join(districts, dist_area, by = "distName")
#> Simple feature collection with 170 features and 2 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: 29.58953 ymin: -11.76235 xmax: 40.44473 ymax: -0.983143
#> Geodetic CRS: WGS 84
#> First 10 features:
#> distName area geometry
#> 1 Arusha 1237.1814 MULTIPOLYGON (((36.82231 -3...
#> 2 Arusha Urban 266.9919 MULTIPOLYGON (((36.63247 -3...
#> 3 Karatu 3237.7967 MULTIPOLYGON (((35.89154 -3...
#> 4 Longido 8064.1221 MULTIPOLYGON (((36.35742 -3...
#> 5 Meru 1265.9475 MULTIPOLYGON (((36.89951 -3...
#> 6 Monduli 6981.4622 MULTIPOLYGON (((36.54092 -3...
#> 7 Ngorongoro 15909.0015 MULTIPOLYGON (((35.32059 -1...
#> 8 Ilala 364.9695 MULTIPOLYGON (((39.21843 -6...
#> 9 Kinondoni 538.9068 MULTIPOLYGON (((39.27007 -6...
#> 10 Temeke 728.7056 MULTIPOLYGON (((39.3143 -6....In this case a left_join works because the number of
records is identical and each row in each dataset has a unique district
name.
4.2.2 More subsetting
We have already done a bit of this a couple of sections back, but let’s use this opportunity to do a bit more subsetting, including a previously unseen approach:
# Chunk 30
# #1
kanyi_dists <- districts %>% filter(grepl("^Ka|^Nyi", distName))
#
# #2
dists_gt2m <- districts %>% filter(as.numeric(st_area(.) / 10^6) > 20000)
par(mar = c(0, 0, 0, 0))
plot(st_geometry(districts), col = "grey")
plot(st_geometry(kanyi_dists), col = rgb(0, 0, 1, alpha = 0.5), add = TRUE)
plot(st_geometry(dists_gt2m), col = rgb(1, 0, 0, alpha = 0.5), add = TRUE)That’s two examples of subsetting that that are accomplished with
filter. #1 is the new one, as it uses grepl to
select districts with names beginning with either “Ka” or “Nyi”.
grepl is one of several functions that make use of regular
expressions to identify and do things (select, substitute, split, etc)
to string variables. You can read up more on the subject by starting
with ?grepl and ?regex. Regular expressions
are quite powerful, and although they can be somewhat arcane, they are
well worth learning (although we don’t do much with them here). #2
creates dists_gt2m) by calculating area within
filter on the fly, and using that to select districts
>20000 km\(^2\). The resulting
subsets are plotted onto the full district map of Tanzania, using
another function appearing for the first time in geospaar,
rgb, which we use to define the colors for filling the
polygons. rgb makes red-green-blue color combinations, and
the alpha argument allows us to define how transparent the colors are.
In this case, our two subsets selected some of the same polygons (two of
the districts with names beginning with “Ka” or “Ny” are also larger
than 20,000 km\(^2\)), thus the partial
transparency is needed to show where the overlaps occur.
4.2.3 Split-apply-combine
As with plain old data.frames, we might want to do
split-apply-combine operations on spatial data. Let’s demonstrate this
selecting the districts in districts_alb2 that fall within
certain area ranges, convert the selected districts to centroid points,
and then plot them.
# Chunk 30
# #1
tertiles <- function(x) quantile(x, probs = c(0, 0.333, 0.667, 1))
dist_tertiles <- districts %>%
mutate(area = as.numeric(st_area(.)) / 10^6) %>%
mutate(acls = cut(area, breaks = tertiles(area), include.lowest = TRUE)) %>%
group_by(acls) %>%
summarize(mean_area = mean(area))
#> although coordinates are longitude/latitude, st_union assumes that they are planar
dist_tertiles
#> Simple feature collection with 3 features and 2 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: 29.58953 ymin: -11.76235 xmax: 40.44473 ymax: -0.983143
#> Geodetic CRS: WGS 84
#> # A tibble: 3 × 3
#> acls mean_area geometry
#> <fct> <dbl> <MULTIPOLYGON [°]>
#> 1 [15.5,1.51e+03] 651. (((32.75855 -9.249037, 32.76249 -9.252446, 32.76649 -9.254365, 32.7…
#> 2 (1.51e+03,5.19e+03] 3146. (((31.4864 -7.460071, 31.4871 -7.461485, 31.48803 -7.465252, 31.487…
#> 3 (5.19e+03,3.43e+04] 11765. (((35.86454 -11.41009, 35.86409 -11.41199, 35.8631 -11.41347, 35.86…
#
# #2
par(mar = rep(0, 4))
plot(st_geometry(dist_tertiles), col = c("red", "purple", "blue"))
legend(x = "bottomleft", pch = 15, col = c("red", "purple", "blue"),
bty = "n", legend = c("Bottom third", "Middle third", "Largest third"))The above is a dplyr-based split-apply-combine, which
has a fair bit going on, so let’s explain:
- We first create a new function called
tertiles, which will divide any vectorxit is given into thirds. - In the pipeline, the first line does the usual creation of an
area variable. The second line use the
cutfunction (see?cut) to classify the area variable into thirds (i.e. districts whose areas are below the 33rd percentile area, those between the 33rd and 66th percentile areas, and those above the 66th percentile), using ourtertilefunction, resulting in a newaclsvariable. - The third line in the pipeline groups the data by
acls, and then calculates the mean area of each group. Thissummarizeis spatial, in that it aggregates (dissolves) the polygons that are in group into a single large polygon for each group, while creating an output value describing the average area of the districts that were unioned to make each group. This is achieved by a genericsummarizemethod defined forsf(see?summarise.sf). - The
plotshows the result of the union (plus a legend that we add)
We can do the same thing for standard deviation, but using a quintile grouping:
# Chunk 31
quintiles <- function(x) quantile(x, probs = seq(0, 1, 0.2))
dist_quintiles <- districts %>%
mutate(area = as.numeric(st_area(.)) / 10^6) %>%
mutate(acls = cut(area, breaks = quintiles(area), include.lowest = TRUE)) %>%
group_by(acls) %>%
summarize(sd_area = sd(area))
#> although coordinates are longitude/latitude, st_union assumes that they are planar
dist_quintiles
#> Simple feature collection with 5 features and 2 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: 29.58953 ymin: -11.76235 xmax: 40.44473 ymax: -0.983143
#> Geodetic CRS: WGS 84
#> # A tibble: 5 × 3
#> acls sd_area geometry
#> <fct> <dbl> <MULTIPOLYGON [°]>
#> 1 [15.5,718] 192. (((32.75855 -9.249037, 32.76249 -9.252446, 32.76649 -9.254365, 32.7674…
#> 2 (718,2.11e+03] 408. (((33.44827 -9.187084, 33.44973 -9.191858, 33.45134 -9.195909, 33.4525…
#> 3 (2.11e+03,4e+03] 546. (((31.01872 -2.843939, 31.01967 -2.84478, 31.01979 -2.846578, 31.02241…
#> 4 (4e+03,7.9e+03] 1097. (((32.44603 -8.245552, 32.44556 -8.246121, 32.44575 -8.24695, 32.44642…
#> 5 (7.9e+03,3.43e+04] 6703. (((35.86124 -11.41664, 35.85951 -11.41725, 35.85778 -11.41777, 35.8556…
cols <- heat.colors(5)
par(mar = rep(0, 4))
plot(st_geometry(dist_quintiles), col = cols)
legend(x = "bottomleft", pch = 15, col = cols, bty = "n",
legend = paste0(1:5, c("st", "nd", "rd", "th", "th"), " quantile"))Our plot in this case uses the heat.colors() function
for filling the polygons.
That was a quick introduction to SAC for sf. There are
others we could do (you can visit the tidyverse
examples page for sf to see some others), but we will
move on to the learn about other spatial operations now.
4.3 Practice
4.3.1 Questions
How different are areas calculated using projected and unprojected (geographic) coordinate systems when using
st_area? How comest_area“knows” how to calculate an area in m\(^2\) when the CRS is unprojected (the answer isn’t in the material above, but can be found if you read through?st_area)?You will note that we haven’t used
ggplot() + geom_sf()in this section. Why is that?How can we create a grouping variable that we can use to split an
sfobject according to different levels of spatial properties (e.g. area)?
4.3.2 Code
Using a seed of 1, randomly select (
sample_n) 10 districts fromdistrictsand calculate their average area in hectares.Using a seed of 1, randomly select (
sample_n) 100 road segments fromroadsand calculate their average length in kilometers.Plot districts with a “lightgrey” background, and then randomly select (seed of 1) 200 observations from species of Syncerus caffer, and plot them as red circles over
districts.From
roadselect the roads that have lengths between 50 and 100 km and plot those in red over districts. Note thatdistrictsandroadsare in different projections, so transformdistrictto have the same CRS asroadsfirst.Hack the code in Chunk 31 to create a map of districts merged by their deciles of area. Do the summary by total area. To do this, the
quantilefunction will have to be altered (passseq(0, 1, 0.1)to “probs”), and make sure your color vector has enough values.
5 Spatial Operations
In this section we will focus on some of the more common polygon-based spatial operations. To enable this, we are first going to create some additional spatial data.
# Chunk 32
# #1
coords <- cbind("x" = c(34.2, 34.3, 35.6, 35.6, 35.6, 34.2),
"y" = c(-4.8, -3.8, -3.9, -4.1, -4.9, -4.8))
#
# #2
pol <- st_polygon(x = list(coords)) %>%
st_sfc %>%
st_sf(ID = 1, crs = 4326)
pol
#> Simple feature collection with 1 feature and 1 field
#> Geometry type: POLYGON
#> Dimension: XY
#> Bounding box: xmin: 34.2 ymin: -4.9 xmax: 35.6 ymax: -3.8
#> Geodetic CRS: WGS 84
#> ID .
#> 1 1 POLYGON ((34.2 -4.8, 34.3 -...
par(mar = rep(0, 4))
plot(st_geometry(districts), col = "grey")
plot(pol, col = "purple", add = TRUE)The above takes coords, a matrix of x and y
coordinates (#1), and repeats the approach we used in Chunk 19 (Section
3.6) to create an st_polygon (#2), but then continues on to
convert that to a simple feature geometry list column
(st_sfc), then to create an sf, a
“data.frame-like object…with a simple feature list column”
(?st_sf), including the assignment of a CRS. That’s how you
build up a full simple feature with attributes.
5.1 Overlays/spatial queries
The first thing we will do with our made-up polygon is figure out which districts it intersects. What we are going to do in this case is the ArcGIS equivalent of select by spatial location (if I remember my ESRI terminology correctly–it’s been a while since I looked at it)
# Chunk 32
# #1
st_intersects(x = pol, y = districts)
#> although coordinates are longitude/latitude, st_intersects assumes that they are planar
#> Sparse geometry binary predicate list of length 1, where the predicate was `intersects'
#> 1: 3, 13, 15, 68, 70, 72, 142, 143, 144, 146, ...
#
# #2
dists_int <- districts %>% slice(st_intersects(x = pol, y = districts)[[1]])
#> although coordinates are longitude/latitude, st_intersects assumes that they are planar
dists_int
#> Simple feature collection with 12 features and 1 field
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: 33.91347 ymin: -5.683134 xmax: 36.34537 ymax: -2.928766
#> Geodetic CRS: WGS 84
#> First 10 features:
#> distName geometry
#> 1 Karatu MULTIPOLYGON (((35.89154 -3...
#> 2 Chemba MULTIPOLYGON (((35.35449 -4...
#> 3 Kondoa MULTIPOLYGON (((35.65407 -4...
#> 4 Babati MULTIPOLYGON (((35.72509 -3...
#> 5 Hanang MULTIPOLYGON (((34.92725 -4...
#> 6 Mbulu MULTIPOLYGON (((35.36185 -3...
#> 7 Meatu MULTIPOLYGON (((34.09986 -3...
#> 8 Ikungi MULTIPOLYGON (((34.59843 -4...
#> 9 Iramba MULTIPOLYGON (((34.50044 -3...
#> 10 Mkalama MULTIPOLYGON (((34.92725 -4...
#
# #3
par(mar = rep(0, 4))
plot(st_geometry(districts), col = "grey")
plot(st_geometry(dists_int), col = "blue", add = TRUE)
plot(st_geometry(pol), border = "purple", add = TRUE)In #1 we use st_intersects to identify the indices of
the districts that intersect pol. So that
doesn’t give us back the actual intersecting district polygons. To get
those, we have to do a bit more. The output from
st_intersects is a list object, which contains the index of
the polygon in x (the “1:” in the output) followed by the polygons in
the object passed to y that it intersects (15, 22, 26, 53, 54). We can
simplify that results to a vector by specifying the list index we want
(1): st_intersects(x = pol, y = districts)[[1]]. We plug
that into slice in #2 to pull out the polygons from
districts into a new object dists_int, and
then #3 illustrates the intersected districts in blue.
5.2 Intersecting
We have just seen how st_intersects can be used to
identify and extract intersecting polygons. Now let’s look at what
st_intersection does.
# Chunk 33
pol_int_dists <- st_intersection(x = pol, y = districts)
#> although coordinates are longitude/latitude, st_intersection assumes that they are planar
#> Warning: attribute variables are assumed to be spatially constant throughout all geometries
par(mar = rep(0, 4))
plot(st_geometry(districts), col = "grey")
plot(st_geometry(pol_int_dists), col = rainbow(n = nrow(pol_int_dists)),
add = TRUE)Seems pretty clear what st_intersection does, and how it
differs from st_intersects: it crops the intersected
districts to the boundaries of pol, and returns the
intersected district polygons. It also spits out a warning about the
attribute variable being spatially constant, as well as the usual
warning about the data not being in planar coordinates. We also
introduced another color function (rainbow) to distinguish
the different intersected districts.
5.3 Aggregating/Dissolving
Now that we have intersected boundaries, let’s dissolve them. We’ve actually already seen this (in section 4.2.3), but let’s have a second look:
# Chunk 34
# #1
pol_int_dists %>% summarize
#> although coordinates are longitude/latitude, st_union assumes that they are planar
#> Simple feature collection with 1 feature and 0 fields
#> Geometry type: POLYGON
#> Dimension: XY
#> Bounding box: xmin: 34.2 ymin: -4.9 xmax: 35.6 ymax: -3.8
#> Geodetic CRS: WGS 84
#> .
#> 1 POLYGON ((34.89974 -4.84998...
#
# #2
pol_int_muarea <- pol_int_dists %>%
mutate(area = as.numeric(units::set_units(st_area(.), "km^2"))) %>%
summarize(area = mean(area))
#> although coordinates are longitude/latitude, st_union assumes that they are planar
pol_int_muarea
#> Simple feature collection with 1 feature and 1 field
#> Geometry type: POLYGON
#> Dimension: XY
#> Bounding box: xmin: 34.2 ymin: -4.9 xmax: 35.6 ymax: -3.8
#> Geodetic CRS: WGS 84
#> area .
#> 1 1385.954 POLYGON ((34.89974 -4.84998...
par(mfrow = c(1, 2))
pol_int_dists %>%
st_geometry() %>%
plot(col = rainbow(n = nrow(pol_int_dists)), main = "Intersections")
pol_int_muarea %>%
st_geometry() %>%
plot(col = "blue", main = "Dissolved intersections")sf’s generic for summarize does the job for
us. In #1, we run summarize without specifying a variable
to summarize, and it returns a field-less (attribute-less) single
polygon (which is identical to pol, but not mapped). In #2,
we add a variable that is meaningful to summarize, which is the area of
the intersected districts. Our summary is the average area of those
intersected districts.
For a GIS person, it might be confusing to run a function called
summarize when you specifically want to do a dissolve operation.
Shouldn’t there be a function named that, or something close to it?
Well, there is one: st_union, which is actually called by
summarise.sf (which is called by the generic
summarize or summarise in the presence of an
sf object) by default. We can see its behavior here:
Same result, although in this case it produces a geometry set, and,
unlike summarize, st_union does not retain
information about the data values associated with the polygons.
5.4 Casting
Sometimes a spatial operation will result in mixed geometry types
within a single sf object, or a more complex geometry type
than you actually need. You might need to make all the geometries in the
object uniform (to prevent problems with other functions that don’t want
mixed geometry types), or simplify. Doing this is the job of
st_cast. Let’s look again at pol_int_dists
# Chunk 36
pol_int_dists
#> Simple feature collection with 12 features and 2 fields
#> Geometry type: GEOMETRY
#> Dimension: XY
#> Bounding box: xmin: 34.2 ymin: -4.9 xmax: 35.6 ymax: -3.8
#> Geodetic CRS: WGS 84
#> First 10 features:
#> ID distName .
#> 1 1 Karatu POLYGON ((35.04069 -3.85697...
#> 1.1 1 Chemba POLYGON ((35.33888 -4.88134...
#> 1.2 1 Kondoa POLYGON ((35.6 -4.9, 35.433...
#> 1.3 1 Babati POLYGON ((35.6 -4.1, 35.6 -...
#> 1.4 1 Hanang MULTIPOLYGON (((35.6 -4.696...
#> 1.5 1 Mbulu POLYGON ((35.6 -3.9, 35.6 -...
#> 1.6 1 Meatu POLYGON ((34.3 -3.8, 34.71 ...
#> 1.7 1 Ikungi MULTIPOLYGON (((34.20477 -4...
#> 1.8 1 Iramba POLYGON ((34.27165 -4.08353...
#> 1.9 1 Mkalama POLYGON ((34.75013 -3.83462...
rbcols <- rainbow(n = nrow(pol_int_dists))
par(mar = rep(0, 4))
pol_int_dists %>%
st_geometry %>%
plot(col = rbcols)Notice how the object contains three polygons and one multipolygon,
which is the blue district (Mufumbwe), which needs to be MULTIPOLYGON
because the light green district (Kaoma) drives a wedge between it,
making it two pieces. Let’s run st_cast:
# Chunk 37
pol_int_dists %>% st_cast
#> Simple feature collection with 12 features and 2 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: 34.2 ymin: -4.9 xmax: 35.6 ymax: -3.8
#> Geodetic CRS: WGS 84
#> First 10 features:
#> ID distName .
#> 1 1 Karatu MULTIPOLYGON (((35.04069 -3...
#> 1.1 1 Chemba MULTIPOLYGON (((35.33888 -4...
#> 1.2 1 Kondoa MULTIPOLYGON (((35.6 -4.9, ...
#> 1.3 1 Babati MULTIPOLYGON (((35.6 -4.1, ...
#> 1.4 1 Hanang MULTIPOLYGON (((35.6 -4.696...
#> 1.5 1 Mbulu MULTIPOLYGON (((35.6 -3.9, ...
#> 1.6 1 Meatu MULTIPOLYGON (((34.3 -3.8, ...
#> 1.7 1 Ikungi MULTIPOLYGON (((34.20477 -4...
#> 1.8 1 Iramba MULTIPOLYGON (((34.27165 -4...
#> 1.9 1 Mkalama MULTIPOLYGON (((34.75013 -3...That converts all POLYGONS to MULTIPOLYGONs, the simplest common geometry type that can be shared by all 4 geometry features. One can also specify the type of geometry you want it to be:
# Chunk 38
pol_int_dists %>% st_cast("POLYGON")
#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only
#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only
#> Simple feature collection with 12 features and 2 fields
#> Geometry type: POLYGON
#> Dimension: XY
#> Bounding box: xmin: 34.2 ymin: -4.9 xmax: 35.6 ymax: -3.8
#> Geodetic CRS: WGS 84
#> First 10 features:
#> ID distName .
#> 1 1 Karatu POLYGON ((35.04069 -3.85697...
#> 1.1 1 Chemba POLYGON ((35.33888 -4.88134...
#> 1.2 1 Kondoa POLYGON ((35.6 -4.9, 35.433...
#> 1.3 1 Babati POLYGON ((35.6 -4.1, 35.6 -...
#> 1.4 1 Hanang POLYGON ((35.6 -4.696429, 3...
#> 1.5 1 Mbulu POLYGON ((35.6 -3.9, 35.6 -...
#> 1.6 1 Meatu POLYGON ((34.3 -3.8, 34.71 ...
#> 1.7 1 Ikungi POLYGON ((34.20477 -4.75227...
#> 1.8 1 Iramba POLYGON ((34.27165 -4.08353...
#> 1.9 1 Mkalama POLYGON ((34.75013 -3.83462...It throws a warning, because Mufumbwe can’t be reduced to a POLYGON without jettisoning one of two pieces, so it let’s us know that it had to do that. We can see the result:
# Chunk 39
par(mfrow = c(1, 3), mar = c(0, 0, 1, 0))
pol_int_dists %>%
st_geometry %>%
plot(col = rbcols, main = "Uncast")
pol_int_dists %>%
st_cast() %>%
st_geometry %>%
plot(col = rbcols, main = "Cast to MULTIPOLYGON")
pol_int_dists %>%
st_cast("POLYGON") %>%
st_geometry %>%
plot(col = rbcols, main = "Cast to POLYGON")
#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part only
#> Warning in st_cast.MULTIPOLYGON(X[[i]], ...): polygon from first part onlyNo difference between the original pol_int_dists and the
straight cast of it, but when we try cast to POLYGON, the bottom part of
the blue district disappears.
Plotting tip: note how we add titles to the plots and also tweak the margins to add space to the top margin so that those titles can be displayed.
We can also cast the polygons to lines, and even to points:
# Chunk 40
par(mfrow = c(1, 2), mar = c(0, 0, 1, 0))
pol_int_dists %>%
st_cast("MULTILINESTRING") %>%
st_geometry %>%
plot(col = rbcols, main = "Cast to MULTILINESTRING")
pol_int_dists %>%
st_cast("MULTIPOINT") %>%
st_geometry %>%
plot(col = rbcols, main = "Cast to MULTIPOINT", pch = 21, cex = 0.5)#> null device
#> 1
So that’s st_cast. Not really a spatial operation
per se, but can be important for transitioning between spatial
operations, as indicated by this passage from sf’s 3rd
vignette:
When functions return objects with mixed geometry type (GEOMETRY), downstream functions such as st_write may have difficulty handling them. For some of these cases, st_cast may help modifying their type. For sets of geometry objects (sfc) and simple feature sets (sf),st_cast` can be used by specifying the target type, or without specifying it.
To learn more st_cast and its behavior, please read that
vignette.
5.5 Differencing
Now we are going to move on to look at another bread and butter operation, differencing (or erasing).
# Chunk 41
d1 <- st_difference(x = districts, y = pol)
#> although coordinates are longitude/latitude, st_difference assumes that they are planar
#> Warning: attribute variables are assumed to be spatially constant throughout all geometries
d2 <- st_difference(x = pol, y = districts)
#> although coordinates are longitude/latitude, st_difference assumes that they are planar
#> Warning: attribute variables are assumed to be spatially constant throughout all geometries
par(mfrow = c(1, 2), mar = c(0, 0, 1, 0))
d1 %>%
st_geometry %>%
plot(col = "grey")
d2 %>%
st_geometry %>%
plot(col = "grey")d1 erases pol, leaving a hole, whereas
d2, by reversing the order of objects into the “x” and “y”
arguments, does the same thing as an intersection, by erasing all the
districts surrounding pol.
5.6 Unioning
We have already seen st_union in action to dissolve
internal polygon boundaries. Now let’s look at unioning two objects.
# Chunk 42
# #1
uni1 <- st_union(x = pol, y = districts)
#> although coordinates are longitude/latitude, st_union assumes that they are planar
#> Warning: attribute variables are assumed to be spatially constant throughout all geometries
uni1
#> Simple feature collection with 170 features and 2 fields
#> Geometry type: GEOMETRY
#> Dimension: XY
#> Bounding box: xmin: 29.58953 ymin: -11.76235 xmax: 40.44473 ymax: -0.983143
#> Geodetic CRS: WGS 84
#> First 10 features:
#> ID distName .
#> 1 1 Arusha MULTIPOLYGON (((34.2 -4.8, ...
#> 1.1 1 Arusha Urban MULTIPOLYGON (((34.2 -4.8, ...
#> 1.2 1 Karatu POLYGON ((34.3 -3.8, 34.750...
#> 1.3 1 Longido MULTIPOLYGON (((34.2 -4.8, ...
#> 1.4 1 Meru MULTIPOLYGON (((34.2 -4.8, ...
#> 1.5 1 Monduli MULTIPOLYGON (((34.2 -4.8, ...
#> 1.6 1 Ngorongoro MULTIPOLYGON (((34.2 -4.8, ...
#> 1.7 1 Ilala MULTIPOLYGON (((34.2 -4.8, ...
#> 1.8 1 Kinondoni MULTIPOLYGON (((34.2 -4.8, ...
#> 1.9 1 Temeke MULTIPOLYGON (((34.2 -4.8, ...
#
# #2
uni2 <- st_union(x = districts, y = pol)
#> although coordinates are longitude/latitude, st_union assumes that they are planar
#> Warning: attribute variables are assumed to be spatially constant throughout all geometries
uni2
#> Simple feature collection with 170 features and 2 fields
#> Geometry type: GEOMETRY
#> Dimension: XY
#> Bounding box: xmin: 29.58953 ymin: -11.76235 xmax: 40.44473 ymax: -0.983143
#> Geodetic CRS: WGS 84
#> First 10 features:
#> distName ID geometry
#> 1 Arusha 1 MULTIPOLYGON (((36.82231 -3...
#> 2 Arusha Urban 1 MULTIPOLYGON (((36.63247 -3...
#> 3 Karatu 1 POLYGON ((35.88634 -3.26032...
#> 4 Longido 1 MULTIPOLYGON (((36.35742 -3...
#> 5 Meru 1 MULTIPOLYGON (((36.89951 -3...
#> 6 Monduli 1 MULTIPOLYGON (((36.54092 -3...
#> 7 Ngorongoro 1 MULTIPOLYGON (((35.32059 -1...
#> 8 Ilala 1 MULTIPOLYGON (((39.21843 -6...
#> 9 Kinondoni 1 MULTIPOLYGON (((39.27007 -6...
#> 10 Temeke 1 MULTIPOLYGON (((39.3143 -6....
#
# #3
par(mfrow = c(1, 2), mar = c(0, 0, 1, 0))
plot(uni1[2], reset = FALSE, main = "Union of pol with districts")
plot(uni2[2], reset = FALSE, main = "Union of districts with pol")Two different order to unioning pol and
districts (#1 and #2). The ordering simply results in the
ordering of the attribute fields in the resulting unioned datasets (the
fields from the object in the “x” argument appear first) after the
overlapping polygons are merged. You can see the difference in the
ordering in the resulting plots (#3), in which for the first time we use
index notation to specify the field of each object that we want the plot
to render. In this case, we select the second field, which is different
in each unioned object, hence the different fill patterns. This
index-based referencing is a good way to avoid having to use
st_geometry, while letting sf::plot do the
work of selecting a color scheme.
5.7 Buffering
This is a fairly common GIS operation that sf handles
nicely.
# Chunk 43
#
# #1
species_alb <- species %>% st_transform(st_crs(roads))
long_roads <- roads %>% filter(as.numeric(st_length(.)) > 50000)
districts_alb <- districts %>% st_transform(st_crs(roads))
# #2
buffs <- lapply(c(10000, 20000, 30000), function(x) {
set.seed(123)
#2a
buf_species <- species_alb %>% filter(species == "Syncerus caffer") %>%
st_sample(size = 5) %>% st_buffer(dist = x) # here is the buffer
#2b
buf_roads <- long_roads %>% st_buffer(dist = x) # here is the buffer
#2c
list("species" = buf_species, "roads" = buf_roads)
})
#
# using purrr::map
# buffs <- c(10000, 20000, 30000) %>% map(function(x) {
# set.seed(123)
# #2a
# buf_species <- species_alb %>% filter(species == "Syncerus caffer") %>%
# st_sample(size = 5) %>% st_buffer(dist = x)
# #2b
# buf_roads <- long_roads %>% st_buffer(dist = x)
# list("species" = buf_species, "roads" = buf_roads)
# })
#
# #3
par(mar = c(0, 0, 0, 0))
cols <- c("cyan", "blue2", "orange") # 2
cols2 <- c("purple", "green4", "antiquewhite")
districts_alb %>% st_union %>% plot(col = "grey")
for(i in length(buffs):1) { # 4
plot(buffs[[i]]$roads, add = TRUE, col = cols2[i]) # 6
plot(buffs[[i]]$species, add = TRUE, col = cols[i]) # 5
}A fair bit of code up there, so let’s walk through it. In #1 we
convert species and districts to Albers
projections, as buffers should be specified in meters, not decimal
degrees, which one would have to do if buffering on a GCS project. We
also extract from roads the segments that are longer than
50 km. In #2, we then insert our buffering step into an
lapply, as we apply three different buffer widths (10, 20,
and 30 km), specified in meters (the unit of the Albers projection) into
st_buffer. We first buffer around 5 randomly selected
species from the second season, using st_sample instead of
sample_n (which we did previously), because it allows
merging of buffers where they touch, whereas a sample drawn with
sample_n would apply a separate buffer to each point. We
then apply st_buffer to the selected roads. The commented
out chunk of code shows how we can use purrr::map to do the
same work as lapply, using piping to pass the iteration
vector into the map function (you can also write it that
way for lapply).
In #3 we draw the plots, setting up an outline of Tanzania by
dissolving district boundaries, then plotting the buffered shapes in a
for loop (note the reversed index ordering, to plot from
largest to smallest buffers).
5.8 Sampling
We have seen a little bit of sampling already, but the previous examples just sample from the features in an existing object. Sometimes you might be interested in confining the sample to a particular feature or features, for example:
# Chunk 44
par(mar = rep(0, 4))
districts %>%
st_union %>%
plot(col = "grey")
set.seed(1)
districts %>%
st_union %>%
st_sample(size = 100, exact = TRUE) %>%
plot(pch = 20, add = TRUE)That uses st_sample to pull a random sample of 100
points from within the merged district. We set the
exact = TRUE argument to force the returned sample to be
the exact number we specify (otherwise it can sometimes be a bit
less).
Let’s look at this in a bit more depth:
# Chunk 45
# #1
seed <- 40
set.seed(seed)
dists_7 <- districts %>% sample_n(7)
#
# #2
set.seed(seed)
samp14 <- dists_7 %>% st_sample(size = 14, exact = TRUE)
#
# #3
set.seed(seed)
samp7_2 <- dists_7 %>% st_sample(size = rep(2, nrow(dists_7)), exact = TRUE)
par(mar = rep(0, 4))
districts[2] %>% plot(col = "grey", reset = FALSE)
dists_7 %>% plot(col = "khaki", add = TRUE)
samp14 %>% plot(col = "red", pch = 20, add = TRUE)
samp7_2 %>% plot(col = "blue", pch = "+", add = TRUE)In #1 we start by randomly selecting 7 districts from
districts (using sample_n). We then use
st_sample to randomly select 14 points from within those 7
randomly selected districts (#2). In #3, we specify that we want to
randomly select 2 points within each of the 7 districts–to do that, we
create a vector containing 2 repeated as many times as there are rows in
dists_7, so the sample size is specified for each of the 7
sampled districts. The output plots show the differences between the two
sampling strategies. The approach used in #2 distributes an unequal
number of points across the 7 districts, and misses one district
entirely. The approach in #3 places two points in each of the 7
districts. It is not hard to see that one could go from here to create a
stratified random sample in which the size was made proportional to each
district’s area.
So that’s it for this unit, which has given a brief and by no means exhaustive introduction to do some common vector operations and ways in which vector data are manipulated.
5.9 Practice
5.9.1 Questions
Why it is the purpose of
st_cast, and why might we need to cast features?What spatial operation occurs when
summarizeis applied to ansfobject?In what way does the order of unioning affect the result of
st_unionon two spatial objects?
5.9.2 Code
- Using the code in Chunk 32 as your template, create a rectangular
sfpolygon calledpol2using the x coordinates 35 and 35.5 (minimum and maximum) and y coordinates -4 and -4.5. Coordinates vectors should be ordered like this for x: xmin, xmax, xmax, xmin, xmin. And for y: ymin, ymin, ymax, ymax, ymin. In both cases, think of min and max in absolute terms (i.e. drop the negative sign). Plot the result on top ofdistrictsin blue. It should look like this:
Use your new
pol2to do anst_intersectionwithdistricts. Call itpol2_int_dists. Plot it using rainbow colors on top of greydistricts.Calculate the difference between 1) the sum of areas calculated from
pol2_int_distsand 2) the area ofpol2.Use
st_differenceto create apol2sized hole indistricts. Do the difference on the fly (i.e. don’t save it), and pipe the result toplot(col = "grey")From Chunk 43, recreate
species_alb, and use the code in #2a to recreate buffers of 30 km width around the 5 randomly selected species points, and plot those. Recreate the buffers, but usesample_nto do the random draw. Plot those and compare the difference.Select from
roadsthe roads that are greater than 40 km long, and save the result inroads_gt40. Place a 25 km buffer around those roads. Plot the result over the Albers-projected districts (in grey) in green.Select a random sample of exactly 20 points within each of the 25 km buffers around
roads_gt40. Plot the resulting points as red circles over the tan buffered roads and grey districts.
6 Unit assignment
6.1 Set-up
Make sure you are working in the main branch of your project (you should not be working on the a3 branch). Create a new vignette named “unit2_module1.Rmd”. You will use this to document the tasks you undertake for this assignment.
There are no package functions to create for this assignment. All work is to be done in the vignette.
6.2 Vignette tasks
Read in the
species.csv,districts.geojson, androads.geojsondatasets. Convert it to ansfobject. Reproject the species and districts data to Albers projection (using the CRS from roads), naming eachspecies_albanddistricts_alb. Ideally you will do all the necessary steps to createspecies_albanddistricts_albin one pipeline (worth an extra 0.5 points).Create a plot using
sf::plotthat shows all three datasets on one map, withdistricts_albin grey, withroadsin red over that, andspecies_albas a blue cross over that. Use the relevant chunk arguments to center the figure in the vignette html, and to have a height of 4 inches and a width of 6 inches (Hints:fig.align,fig.height,fig.width). The figure should have 0 margins all around (Hint:par).Make the same plot above using
ggplotandgeom_sf. When addingspecies_albto the plot, usepch = "+"andsize = 3as arguments togeom_sf. Add the functiontheme_bw()to theggplotconstruction chain, to get rid of the grey background. Make the “fill” (rather than “color”) ofdistricts_albgrey. Center the figure using chunk options and make the figure width 5 inches and height 6 inches.Select from
districts_albthe district representing the 50th percentile area, i.e. the median area, and save that district into a new objectmedian_dist. Plot it in “khaki” on top of greydistricts_alb. Give the plot a title “The median area district”. Same plot dimensions in the vignette html as for Task 2, but a leave a space of 1 at the top in the plotmar.Convert the
median_distto its centroid point. Call itmedian_distp.filterthespecies_albdata for species belonging to family “Felidae”, and then find the 20 closest Felidae species tomedian_distp. To do that, create the new objectclosest_20speciesby usingmutatewithst_distanceto create a new variable dist (convert it to numeric), and thenarrangeby variable dist and slice out the top 20 observations. Useggplot2to plotdistricts_albin grey,median_distover that in khaki,median_distpas a solid purple circle,species_albin blue, andclosest_20speciesin red. Zero margins (Hint: usetheme(plot.margin = margin(t = 0, r = 0, b = 0, l = 0, unit = "cm"))in theggplot2pipeline) and width of 6 inches and height of 4 inches.Create a rectangular
sfpolygon calledmypolusing the x coordinates 32 and 33 (minimum and maximum) and y coordinates -5 and -6. Assign itcrs = 4326and transform it to Albers. Select fromdistricts_albthe districts that intersectmypol, and plot in “grey40” overdistricts_albin grey, and plot over thatmypolwithout any fill but just a yellow border. Zero margins and width of 6 inches and height of 4 inches. Calculate the area in ha ofmypoland report it in your vignette below this plot.Create
mypol_dist_intfrom the intersection ofmypolanddistricts_alb, recasting the intersected districts to multipolygons, and adding an area variable onto it that reports areas of intersections in hectares. Do all that in one pipeline. Usesf::plotto plotmypol_dist_intin rainbow colors overdistricts_albin grey. Zero margins and width of 6 inches and height of 4 inches. Report the mean and median of intersections in the vignette below the plot.Find the shortest and longest road segments in Tanzania, and place the selected roads into a new object (
roads_extreme). To do this, you will need toarrangeroadsby length and thensliceto get the first and last observations (of course you need to first calculate length). Do that as one pipeline. Then calculate a 50 km buffer around those two roads (roads_extreme_buff). Usesf::plotto plotroads_extreme_buffin blue overdistricts_albin grey, and addroads_extremeon top of that as red lines (uselwd = 3in the plot). Zero margins and width of 6 inches and height of 4 inches.Select a random sample of 10 points in the smallest object, and one of 50 in the largest object in
roads_extreme_buff. Use a single call tost_sampleto do that (Hint: check the help of functionst_sample). Use a seed of 2. Usesf::plotto plot those points as yellow solid points over the same map created in Task 8 above. Use the same dimensions.Your final task is to intersect
roadswith the buffer of the longest road inroads_extreme_buff(name the result asroads_int). Plot the buffer of the longest road in blue over the districts in grey, and thenroads_intas red lines. Use the same dimensions as the previous two plots. Report the total distance of intersected roads in km in the vignette below the plot.
6.3 Assignment output
Refer to Unit 1 Module 4’s assignment output section for submission instructions. The only differences are as follows:
You should increment your package version number by 1 on the lowest number (e.g. from 0.1.2 to 0.1.3) in the DESCRIPTION;
When asked to report the result of calculations in the vignette text (e.g. in Task 10), you can use another RMarkdown trick, which is to let the calculation be done by code inserted between a single pair of backticks, with the first backtick followed immediately by r and a space. See here for more explanation of that.
Revise the setting chunk in the beginning of your vignette to:
to avoid unnecessary warning and message.
Update vignette title and this line in the YAML header to
%\VignetteIndexEntry{Unit 2 Module 1 assignment 4}. When you browse the vignettes of your package, you should see a page in your browser like this:
- Your submission should be on a new side branch “a4”.